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Abstract 

Background: Change point problems arise in many genomic analyses such as the detection of copy number 
variations or the detection of transcribed regions. The expanding Next Generation Sequencing technologies now 
allow to locate change points at the nucleotide resolution. 

Results: Because of its complexity which is almost linear in the sequence length when the maximal number of 
segments is constant, and as its performance had been acknowledged for microarrays, we propose to use the Pruned 
Dynamic Programming algorithm for Seq-experiment outputs. This requires the adaptation of the algorithm to the 
negative binomial distribution with which we model the data. We show that if the dispersion in the signal is known, 
the PDP algorithm can be used, and we provide an estimator for this dispersion. We describe a compression 
framework which reduces the time complexity without modifying the accuracy of the segmentation. We propose to 
estimate the number of segments via a penalized likelihood criterion. We illustrate the performance of the proposed 
methodology on RNA-Seq data. 

Conclusions: We illustrate the results of our approach on a real dataset and show its good performance. Our 
algorithm is available as an R package on the CRAN repository. 

Keywords: Segmentation algorithm. Exact algorithm. Fast algorithm, RNA-Seq data. Genome annotation. Count 
data. Data compression 



Background 

Change-point detection methods have long been used in 
the analysis of genetic data as an efficient tool in the study 
of DNA sequences for various purposes. For instance, seg- 
mentation methods have been developed for categorical 
variables with the aim of identifying patterns for gene 
predictions [1,2], while SNPs have been detected using 
sequence segmentation [3]. In the last two decades, with 
the large spread of micro-arrays, change-point methods 
have been widely used for the analysis of DNA copy num- 
ber variations and the identification of amplification or 
deletion of genomic regions in pathologies such as cancer 
[4-8]. 

The recent development of Next-Generation Sequenc- 
ing technologies gives rise to new applications along with 
new difficulties: (a) the increased size of profiles (up to 10^ 
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data-points when micro-array signals were closer to 10^), 
and {b) the discrete nature of the output (number of reads 
starting at each position of the genome). Yet applying seg- 
mentation methods to DNA-Seq data with their greater 
resolution should lead to the analysis of copy-number 
variation with a much improved precision compared to 
CGH arrays. Moreover, in the case of poly-(A) RNA- 
Seq data on lower organisms, since coding regions of the 
genome are well separated from non-coding regions with 
lower activity, segmentation methods should allow the 
identification of transcribed genes as well as address the 
issue of new transcript discovery. Our objective is there- 
fore to develop a segmentation method to tackle both 
(a) and (b) with some specific requirements: the amount 
of reads falling within a segment should be represen- 
tative of the biological information associated (relative 
copy-number of the region, relative level of expression of 
the gene), and comparison to neighboring regions should 
be sufficient to label the segment (for instance normal 
or deleted region of the chromosome in DNA-Seq data, 
exon or non-transcribed region in RNA-Seq), therefore 
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no comparison profile should be needed. This also sup- 
presses the need for normalization, and consequently we 
wish to analyze the raw count-profile. 

Up to now, most methods addressing the analysis of 
these datasets require some normalization process to 
allow the use of algorithms which rely on Gaussian- 
distributed data or which were previously developed for 
micro-arrays [9-12]. Indeed, methods adapted to count 
datasets are not numerous and are highly focused on the 
Poisson distribution. Alteration of genomic sequences can 
be detected based on the comparison of Poisson processes 
associated with the read counts of a case and a control 
sample [13], but this cannot be applied to the detection of 
transcribed regions in a single condition. 

Still, a likelihood ratio statistic was proposed for the 
localization of a shift in the intensity of a Poisson process 
[14], and a test statistic was proposed for the existence of a 
change-point in the Poisson autoregression of order 1 [15]. 

These last two methods do not require a comparison 
profile but they only allow for the detection of a single 
change-point and have too high a time-complexity to be 
applied to RNA-Seq profiles. Binary Segmentation, a fast 
heuristic [6], and Pruned Exact Linear Time (PELT) [16], 
an exact algorithm for optimal segmentation with respect 
to the likelihood, are both implemented for the Poisson 
distribution in the changepoint package. Even though 
both are extremely fast, do not require a comparison pro- 
file, and analyze count-data, the Poisson distribution is not 
adapted to our kind of datasets. 

A recent study [17] has compared 13 segmentation 
methods for the analysis of chromosomal copy number 
profiles and has shown the excellent performance of the 
Pruned Dynamic Programming (PDP) algorithm [18] pro- 
posed in its initial implementation for the analysis of 
Gaussian data in the R package cghseg. We propose to 
use this algorithm, which we have implemented for the 
Poisson and negative binomial distributions. 

In the next section we recall the general segmentation 
framework and the definition and requirements of the 
PDP algorithm. Our contributions are given in the third 
section where we define the negative binomial model and 
show that it satisfies the PDP algorithm requirements. 
We also provide a theoretical result for the possibility to 
compress the data, and finally we give a model selection 
criterion with theoretical guarantees, which makes the 
whole approach complete. We conclude with a simulation 
study, which illustrates the performance of the proposed 
method. 

Segmentation model and algorithm 

General segmentation model 

The general segmentation problem consists in partition- 
ing a signal of n data-points {j^l^eji,^]] into a given number 
K of pieces or segments. The model can be written as 



follows: the observed data {yt]t^i,...,n are supposed to be 
a realization of an independent random process Y = 
{Yt}t=i,...,n' This process is drawn from a probability dis- 
tribution Q which depends on a set of parameters among 
which one parameter 0 is assumed to be affected by /C — 1 
abrupt changes, called change-points, such that 

Yt ~ Q(Ory 0) if ^ G r and rem 

where m is a partition of [[ 1, into segments r. Or stands 
for the parameter of segment r and (p is constant. The 
objective is to estimate the change-points or the posi- 
tions of the segments and the parameters Or both resulting 
from the segmentation. More precisely, we define 
the set of all possible partitions in k > 0 regions of the 
sequence up to point t We recall that the number of 
possible partitions is 



c^rd(Mk,t) 



We aim at choosing the partition in Mx^n of minimal 
loss y, where the loss is usually taken as the nega- 
tive log-likelihood of the model. We define the point- 
additive loss of a segment with given parameter 0 as 
c(r, 0) = J2i e rV therefore its optimal cost is c(r) = 

mine {c(r,0)}. This allows us to define the cost of a seg- 
mentation m as ^ ^ c(r) and our goal is to recover the 
optimal segmentation Mk^^ and its cost C/^,^ which are 
particular cases of the generic optimal segmentation of the 
signal up to point tink segments and its cost, defined as: 

^k,t = argmin^^ ^ ^^^j j 

ir e m 

and Ck,t = mm{m e Mk,t} \ XI ^^^^ 



Quick overview of the PDP algorithm 

Like the original DP algorithm, the pruned DP algorithm 
is an iterative algorithm based on the minimization of a 
cost function Ck,t which is traditionally decomposed as: 

Ck,t= min +min[c([r + l,^],6>)] (1) 

{k-i<T<t] [ e J 

where 0 is the parameter of the cost of the last segment, 
constraints on its possible values being directly related to 
the support of the loss function y (for instance 0 takes its 
value in M in the case of the Gaussian loss, but in [0, 1] 
in the case of the binomial loss). In what follows we will 
denote by the set of possible values for parameter 0, 

The specificity of the PDP algorithm is that it relies on 
the comparison of candidates for the last change-point 
position r through the permutation of the minimizations 
in (1) and the introduction of the functions: 



Hk,tiO) = min 

k-l<x<t 



[Q-l,r + c([r + l,^],^)}. 
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which are the cost of the best partition in k regions up to 
t, the parameter of the last segment being 0. Ck,t is then 
obtained as mine {H]^^t{0)}. 

Then at each iteration /c, the PDP algorithm works on 
a list of last change-point candidates: ListCandidate^^. For 
each of these rs and for each value of U it updates the set 
of ^s, denoted 5'^^^ for which this candidate is optimal If 
this set is empty, the candidate is discarded, resulting in 
the pruning and lower complexity of the algorithm. 

The foundations of the algorithm can be written as 
follows. 

• DeimmgHl^iO) = + Z^r^i ViypO) the 
optimal cost if the last change is r and last parameter 
is 0, then 

(i) is obtained homH'^^iO) using: 

• Deimmgll^ = [o \ Hl/0) < Hl/0) ] = 

|6> I Hl^(0) < Ck-i,t I the set of 0 such that r is 
better than t in terms of cost, with x < then 

(ii) if all X!/=r+i y^i' ^) unimodal in 0 then /^^ 
are intervals. Indeed, since by definition 
Hj^^(0) = Ck-i,t and the cost function does not 
depend on ^, 7^^ is the set of values for which a 
unimodal function is smaller than a constant. 

• Finally, we introduce .SJ^^ = \yO \ < Hk,t{0) | 
the set of 6 such that r is optimal. Then since 
H]^^t{6) = minT:<^ |//^^(^)|, S^^ can be written as 

0 \HI^(0) = Hk,tiO) I and we obtain that 

(iii) can be updated using: 

. cr _ cr Pi i-r 
* = h \ (UrGListCandidateA:^^,^) 

The first assertion follows from the fact that 

[0\HI^, < min (hI%. min,<,<,//,^,^^) ) , 
the first term in the minimum giving /^^^.j^ and 
the second one giving 5^ ^. The second assertion 
trivially follows from the fact that candidate t is 
optimal on the set of values where no other 
candidate was optimal. 

(iV) once it has been determined that ^ is empty, it 
easily follows from the update equation (///) that 
the region-border r can be discarded from the 
list of candidates ListCandidatey^\ 

Sl^0^ yf>t 51,^0. 



Requirements of the pruned dynamic programming 
algorithm. 

Proposition O.l. Properties (i) to (iv) are satisfied as soon 
as the following conditions on the loss c(r, 0) are met: 

(a) It is point additive, 

(b) It is convex with respect to its parameter 0, 

(c) It can be stored and updated efficiently. 

The proof of those claims can be found in [18]. 
A pseudo-code of the PDP algorithm is given in the 
appendix. 

It is possible to include an additional penalty term, 
denoted g as in the pseudo-code, in the loss function. To 
preserve the point- additivity requirement of the loss, this 
penalty can only depend on the value of the segment- 
parameter 0 and not on any other characteristics, such 
as segment length. This is then equivalent to minimizing 
Ck,t = mn{k-i<T<t} {Q-i,r +min^ {c([r + l,t],0) +^(6>)}} 
and can be achieved by adding the penalty value g(0) in 
the initialization of H^^(0). For example, in the case of 
RNA-seq data one could add a lasso (A. |^ |) or ridge penalty 
(kO^) to encode that a priori the coverage in most regions 
should be close to 0. Our C++ implementation of the PDP 
algorithm includes the possibility of adding such a penalty 
term; however we do not provide an R interface to this 
functionality in our R package. One of the reasons for this 
choice is that choosing an appropriate value for A is not a 
simple problem. 

Contribution 

Pruned dynamic programming algorithm for count data 

We now show that the PDP algorithm can be applied to 
the segmentation of RNA-Seq data using a negative bino- 
mial model and we propose a criterion for the choice of /C 
Though not discussed here, our results also hold for the 
Poisson segmentation model. 



Negative binomial model. We consider that in each seg- 
ment r all jt are the realization of random variables Yt 
which are independent and follow the same negative bino- 
mial distribution. Assuming the dispersion parameter 0 
to be known, we will use the natural parametrization 
from the exponential family (also classically used in R) 
so that parameter Or will be the probability of success. In 
this framework. Or is specific to segment r whereas 0 is 
common to all segments. 

Wehave£(n) = 0(l-6>)/6> and y^r(n) = 0(l-6>)/6>2. We 
choose the loss as the negative log-likelihood associated 
with data-point t belonging to segment r: — 01og(^r) — 

log(l - Or) + A{(t),yt), or more simply yiyuOr) = -0 
log(Or) — yt log(l — Or) since A is a function that does not 
depend on Or. 
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Validity of the pruned dynamic programming algo- 
rithm for the negative binomial model 
Proposition Assuming parameter 0 to be known, the 
negative binomial model satisfies (a), (b) and (c): 

(a) As we assume that Yt are independent, we indeed 
have that the loss is point additive: c(r, 0) = 

(b) As yiyty 0) = —0 log(^) — yt log(l — 0) is convex 
with respect to 0, c(r, 0) is also convex as the sum of 
convex functions. 

(c) Finally, we have c(r,0) = —HrCp log(^)+ 

J2t eryt log(l — 0) (where nr is the length of 
segment r). This function can be stored and updated 
using only two doubles: one for —nr(p, say di, and the 
other for ^ ^j/^, say (^2- Then at step ^ + 1 as the 
new datapoint y^+i is considered, these doubles are 
simply updated as /ii ^ di+cp and d2 ^ d2 . 

Estimation of the overdispersion parameter. We pro- 
pose to estimate (p using a modified version of Johnson 
et. al's estimator [19]: compute the moment estimator of 
0 on each sliding window of size h using the formula 
0 = E(Yf/(Var(Y) - E(Y)) and keep the median 0. 

Taking into account a positional bias. It is possible that 
the assumption that the counts share the same distribu- 
tion in a segment might not be verified. For instance in 
the case of RNA-Seq data the number of reads can be 
affected by the location in the transcribed region or by the 
GC-content of the fragment. The pruned dynamic pro- 
gramming algorithm only requires a vector of integers as 
input, it is therefore possible to apply any kind of normal- 
ization process that preserves the count-specificity of the 
data prior to segmentation. For instance, a method such as 
that which has resulted in the publication of the data used 
in the illustration [20] can be applied. A comparison of the 
main normalization methods can for example be found in 
BuUard et, al/s paper [21]. 

C++ implementation of the PDP algorithm 

We implemented the PDP algorithm in C++ having in 
mind the possibility of adding new loss functions in poten- 
tial future applications. The difficulties we had to face 
were the versatility of the program to be designed and the 
design of the objects it could work on. Indeed, the use of 
full templates implied that we used stable sets of objects 
for the operations that were to be performed. 
Namely: 

• The sets were to be chosen in a tribe. This means that 
they all belong to a set S of sets such that any set 
s e S can be conveniently handled and stored in the 



computer. A set of sets S is said to be acceptable if it 
satisfies the following: 

1. If s belongs to SyR\s e S 

2. If 51, 52 G <S, 51 0 52 G 5 

3. If 51, 52 e Sy 51 U52 G 5 

For instance, the set S of intervals is a tribe since the 
complementary, the union and the intersection of 
intervals form a union of intervals. This property 
ensures that the sets ^ and ^ can be updated and 
stored efficiently (only two doubles are required to 
store an interval) to take full advantage of the pruning 
process. 

• The cost functions were chosen in a set T such that 

1. Each function may be conveniently handled and 
stored by the software. 

For instance, for the Gaussian loss it suffices to 
store the three coefficients of a second order 
polynomial. 

2. For any/ G and any constant c,f(x) < c can be 
easily solved and the set of solutions belongs to an 
acceptable set of sets 

3. Forany/,^G 7",/+^ G J". 

These two points ensure that the cost (and 
penalty) functions can be easily updated and 
compared so that the sets ^ of each candidate r 
can be updated and candidates eventually 
discarded. 

Thus we defined two collections for the sets of sets S, 
intervals and parallelepipeds, and implemented the loss 
functions corresponding to negative binomial, Poisson or 
normal distributions. The program is thus designed in 
a way that any user can add his own cost function or 
acceptable set of probability function and use it without 
rewriting a line in the code. 

Compression of the signal 

In the case of count data, and in particular in the analysis 
of RNA-Seq data, it is very likely that we observe plateaux, 
that is regions between two arbitrary positions t\ and t2 
{> t\) where the signal is constant: 

V^, t\<t< t2, yt = yti = yt2' 

Then we have the following proposition, the proof of 
which is given in the appendix. 

Proposition 0.3. There exists a segmentation m in K or 
fewer segments without any change-point in the plateaux 
such that the optimal cost ofm is equal to Cx^n- 

This proposition proves the arguably intuitive idea 
that having a change-point between ti and t2 is never 
beneficial in terms of cost. When searching for the best 
segmentation of the data, it is therefore unnecessary to 
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look for change-points in plateaux. In other words a 
plateau starting at position ti and ending at position t2 can 
be considered as a unique data point with value yt^ and 
weight ^2 — ^1 + 1- At worst the size of the compressed sig- 
nal is equal to the minimum between two times the num- 
ber of reads and the length of the chromosome arm. Thus, 
if the number of reads is very large, the two-step algorithm 
(compression and pruned dynamic programming) does 
not change the worst case complexity. However, in most 
cases the number of reads is much smaller than the size 
of the considered chromosome. Thus compression is effi- 
cient and allows for a significant reduction in the overall 
run-time. Furthermore, in the case of RNA-Seq data we do 
not expect reads to be evenly scattered. On the contrary 
they are concentrated in transcribed regions and between 
those regions we expect large plateaux of 0 allowing for an 
efficient compression (for instance only 2% of the human 
chromosome contains coding regions). 

Model selection 

The last issue concerns the estimate of the number of seg- 
ments K, This model selection issue can be solved using a 
penalized log-likelihood criterion for which the choice of 
a good penalty function is crucial. This kind of procedure 
typically requires computation of the optimal segmen- 
tations in all /: = 1, . . . , /<max segments where /<max is 
generally chosen smaller than n. The most popular cri- 
teria (AIC [22] and BIC [23]) failed in the segmentation 
context due to the discrete nature of the change-points. 
Indeed, additionally to being an asymptotic criterion in a 
framework where the collection of possible models grows 
polynomially with n, the BIC criterion uses a Laplace 
approximation requiring differentiability conditions of the 
likelihood function which are not satisfied by the seg- 
mentation model [24]. From a non-asymptotic point of 
view and for the negative binomial model, the follow- 
ing criterion was proposed [25]: denoting mx the optimal 
segmentation of the data in K segments. 



K= arg min 

/<rGl:/<max 



EE 



+ ^/C (^1+4^1.1 + log (^)^ 



0lQg . , - -J^log 1 

2 



\ 4>+yrJ. 



(2) 



where yr = — — and hr is the size of segment r. The 

first term corresponds to the cost of the optimal segmen- 
tation while the second is a penalty term which depends 
on the dimension K and on a constant ^ that has to be 
tuned according to the data (see the next section). With 
this choice of penalty, a so-called oracle penalty, the result- 
ing estimator satisfies an oracle-type inequality. A more 
complete performance study is done in [25] and showed 
that the proposed criterion outperforms the existing ones. 



Implementation 

The Pruned Dynamic Programming algorithm is available 
in the function Segment or of the R package Segmen- 
torSIsBack. Version 1.7 of this package contains the com- 
pression process which is performed by default in the case 
of count data. The user can choose the distribution with 
the slot mode 1 (1 for Poisson, 2 for Gaussian homoscedas- 
tic, 3 for negative binomial and 4 for segmentation of 
the variance). It returns an S4 object of class Segmen- 
tor which can later be processed for other purposes. The 
function SelectModel provides four criteria for choos- 
ing the optimal number of segments: AIC [22], BIC [23], 
the modified BIC [24] (available for Gaussian and Poisson 
distribution) and oracle penalties (available for the Gaus- 
sian distribution [26] and for the Poisson and negative 
binomial [25] as described previously). This latter kind of 
penalty requires tuning a constant according to the data, 
which is done using the slope heuristic [27]. 

Figure 1 (which is detailed in the Results and discussion 
Section) was obtained with the following 4 lines of code 
(assuming the data was contained in vector x): 

Seg< -Segment or (x, model =3 , Kmax=2 00) 

Kchoose< -SelectModel (Seg, penalty= 
"oracle" ) 

plot (sqrt (x) , col= ' dark red') 

abline (v=getBreaks (Seg) [Kchoose, 
1 : Kchoose] , col= ' blue ' ) 

The function Best Segment at ion allows us, for a 
given /C, to find the optimal segmentation with a change- 
point at location t (slot $bestSeg). It also provides, 
through the slot $ best Cost, the cost of the optimal 
segmentation with t for change-point. Figure 2 (Left) 
illustrates this result for the optimal segmentations in 4 
segments of a signal simulated with only 3 segments. We 
can see for instance that any choice of first change-point 
location between 1 and 2000 yields almost the same cost 
(the minimum is obtained for t = 1481), and thus the 
optimal segmentation is not clearly better than the next 
best segmentations. On the contrary, the same function 
with 3 segments shows that the optimal segmentation 
outperforms all other segmentations in 3 segments. 

Results and discussion 

Performance study 

We designed a simulation study on the negative bino- 
mial distribution to assess the performance of the PDP 
algorithm in terms of computational efficiency without 
using the compression option, while studying the impact 
of the overdispersion parameter 0 by comparing the 
results for two different values of this parameter. After 
running different estimators (median on sliding windows 
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Figure 1 Segmentation of the yeast ciiromosome 1 using the negative binomial loss. The model selection procedure chooses K = ]25 
segments, most of which correspond to the official annotation, with segments corresponding to transcribed regions surrounding official genes. 



of maximum, quasi-maximum likelihood and moment 
estimators) on several real RNA-Seq data (whole chro- 
mosome and genes of various sizes), we fixed 0i = 0.3 
as a typical value for highly dispersed data as observed 
in real RNA-Seq data and chose 02 = 2.3 for compari- 
son with a reasonably dispersed dataset. For each value, 
we simulated datasets of size n with various densities of 
number of segments K, and only two possible values for 
the parameter pj: 0.8 on even segments (corresponding to 
low signal) and 0.2 on odd segments for a higher signal. 
We had n vary on a logarithmic scale between 10^ and 



10^ and K between ^/6 and A/n/3. For each configura- 
tion, we segmented the signal up to /Cmax = twice: 
once with the known value of 0 and once with our esti- 
mator 0 as described above. We started with a window 
width /z = 15. When the estimate was negative, we dou- 
bled h and repeated the experience until the median was 
positive. 

Each configuration was simulated 100 times. 

For our analysis we checked the run-time on a standard 
laptop, and assessed the quality of the segmentation using 
the Rand Index X. Specifically, let Q be the true index 



o 

S 8 
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1st change 
2nd change 
3rd change 



1000 2000 3000 4000 5000 6000 
t 
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Figure 2 Cost of optimal segmentation in 4 and 3 segments. Cost of optimal segmentation depending on tine location of the change-point 
when the number of segments is 4 (Left) and 3 (Right) and the signal was simulated with 3 segments. Illustration of the output of function 
Best Segment at ion. 



Cleynen etal. Algorithms for Molecular Biology 2014, 9:6 
http://www.alnnob.Org/content/9/l/6 



Page 7 of 1 1 



of the segment to which base t belongs and let Q be the 
index estimated by the method, then 



1 = 



2E5=1 J2t>s [1q=q1q=q + IQt^qIq^q] 



(n-l)(n-2) 

Figure 3 shows, for the particular case of /C = ^/3, the 
almost linear complexity of the algorithm in the size n of 
the signal As the maximal number of segments /Cmax con- 
sidered increased with n, we normalized the run-time to 
allow comparison. This underlines an empirical complex- 
ity smaller than ©(/Cmax^ logn), and independent of the 
value of 0 or its knowledge. Moreover, the algorithm, and 
therefore the pruning, is faster when the overdispersion 
is high, a phenomenon already encountered with the 
loss when the distribution of errors is Cauchy. However, 
the knowledge of the true value of 0 does not affect the 
run-time of the algorithm. Figure 4 illustrates through the 
Rand Index the quality of the proposed segmentation for a 
few values of n. Even though the indexes are slightly lower 
for 01 than for 02 (see left panel), they range between 
0.94 and 1 showing a great quality in the results. More- 
over, the knowledge of 0 does not increase the quality (see 
right panel), which validates the use of our estimator. We 
can therefore conclude that the run-time of our algorithm 
without compression is roughly 40 x Kmax x n/lO^s, 

Yeast RNAseq experiment 

We applied our algorithm to the segmentation of chromo- 
some 1 of the S. Cerevisiae (yeast) using RNA-Seq data 




Oe+OO 2e+05 4e+05 6e+05 8e+05 1e+06 
signal size 

Figure 3 Run-time analysis for segmentation with negative 
binomial distribution. This figure displays the normalized (by /(max) 
run-time in seconds of the Segmentor3lsBack package for the 
segmentation of signals with increasing length n, for two values of 
the dispersion (p, and with separate analyses for a known value or an 
estimated value. While the algorithm is faster for more over-dispersed 
data, the estimation of the parameter does not slow the processing. 




I 



estimated phi 

known phi 



Figure 4 Rand Index for the quality of the segmentation. This 
figure displays the boxplot of the Rand Index computed for each of 
the hundred simulations performed in the following situations: 
comparing the values with 0] and 02 when estimated (left figure), 
and comparing the impact of estimating 0i (right figure). While the 
estimation does not decrease the quality of the segmentation, the 
value of the dispersion affects the recovery of the true change-points. 



from the Sherlock Laboratory at Stanford University [20], 
publicly available from the NCBIs Sequence Read Archive 
(SRA, http://www.ncbi.nlm.nih.gov/sra, accession num- 
ber SRA048710). We selected the number of segments 
using our oracle penalty described in the previous section. 
An existing annotation of translated regions {Le, exclud- 
ing un- translated regions (UTR)) is available on the 
Saccharomyces Genome Database (SGD) at http://www. 
yeastgenome.org, which allows us to validate our results. 

With a run-time of 27 minutes without compression, 
and 5.4 minutes with compression (for a signal length 
of 230218), we selected 125 segments with the negative 
binomial distribution. Most of those segments (all but 
3) can be related to the official annotation, however as 
expected segments corresponding to transcribed regions 
(as opposed to intergenic regions) were found to surround 
known genes from the SGD due to the difference between 
transcribed and translated regions. Figure 1 illustrates the 
result. 

We compared our segmentation with that correspond- 
ing to the SGD annotation through the Hellinger distance 
by fitting a negative binomial distribution on each seg- 
ment and repeated this comparison with the other two 
algorithms able to process long count datasets: PELT [16] 
and Binary Segmentation [6], both implemented in the R 
package changepoint for the Poisson distribution. For 
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fair comparison, we also used the PDP algorithm for the 
Poisson loss. Figure 5, together with Table 1 which gives 
the estimated number of segments, the overall Hellinger 
score {J2t^t/t^) and the number of change-points falling 
within annotated translated regions, illustrates the result 
and shows that we outperform the other approaches. 
Moreover, most of the Hellinger peaks observed can 
be explained by the fact that we are comparing the 
annotation of transcribed regions with that of translated 
regions. 

Analysis of complex organisms 

The issues raised in the analysis of RNA-Seq and DNA- 
Seq data differ. In the first case, the number of seg- 
ments that we hope to select is roughly twice the number 
of expressed exons, therefore the order of Kmax varies 
from 10^ (small chromosomes from lower organisms, e.g. 
yeast) to 10"^ (large chromosomes from higher organ- 
isms, e.g. human). However, when aligned to a refer- 
ence genome, RNA-Seq data is expected to present large 
plateaux of zeros at non-coding regions (for instance, 98% 
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Table 1 Comparison of algorithm performance on real data 



Algorithm 


Number of 
segments 


Hellinger 
score 


False 
positives 


PDPA- negative binomial 


125 


0.0120 


39 


PDPA- Poisson 


106 


0.0187 


77 


PELT 


3416 


0.0188 


3003 


Binary Segmentation 


2408 


0.0151 


2072 



Overall Hellinger score of each of the segmentation algorithms, and number of 
estimated change-points falling within regions annotated as translated (thus 
considered as false positives). 



of the human genome) and at non expressed regions. 
The compression option of our algorithm then allows 
us to reduce the size of the profile by a factor of 10 to 
10^. Moreover, it is well-known that centromere regions 
are large non-coding regions where no change-point is 
expected, and we therefore propose to divide the profile 
into two parts at such regions. As a proof of concept we 
ran our algorithm with compression on an RNA-seq pro- 
file of the small arm of the 4th chromosome oi Arabidopsis 
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Figure 5 Comparison of proposed segmentation with annotation and PELT and Binary Segmentation algorithms. Each figure gives the 
Hellinger distance between the estimated segmentation of the algorithm (top: PDP with negative binomial, then: PDP with Poisson, third: PELT, and 
bottom: Binary segmentation) and that of the SGD.The PDP algorithm with the negative binomial distribution seems to outperform other algorithms. 



Cleynen etal. Algorithms for Molecular Biology 2014, 9:6 
http://www.almob.0rg/content/9/l/6 



Page 9 of 1 1 



Thaliana (n = 4.10^, Kmax = 6.10^) and selected 4289 
segments after a compression factor of 10 and a run-time 
of 19 hours on a 2.4Ghz computer. The data was kindly 
provided by some of our collaborators. 

DNA-Seq data on the other hand will present much 
smaller plateaux. While this implies that the compression 
will be less efficient, the profile can still be summa- 
rized into a dataset the length of which will be smaller 
than the total amount of mapped reads. Most impor- 
tantly, in these experiments the expected number of 
segments is drastically smaller as the number of chro- 
mosomic aberrations is generally limited to less than one 
hundred per chromosome, even in pathologies such as 



Conclusion 

Segmentation has been a useful tool for the analysis 
of biological datasets for a few decades. We propose 
to extend its application with the use of the Pruned 
Dynamic Programming algorithm for count datasets such 
as outputs of sequencing experiments. We show that 
the negative binomial distribution can be used to model 
such datasets on the condition that the overdispersion 
parameter is known and have proposed an estimator of 
this parameter that performs well in our segmentation 
framework. 

We propose to choose the number of segments using 
our oracle penalty criterion, which makes the package 
fully operational. This package also allows the use of other 
criteria such as AIC or BIC. Similarly, the algorithm is 
not restricted to the negative binomial distribution but 
also allows the use of Poisson and Gaussian losses for 
instance and could easily be adapted to other convex 
one-parameter losses. 

With its empirical complexity of 0(Kmsixf^^ogn), it 
can be applied to large signals such as read-alignment 
of whole chromosomes, and we illustrated its result 
on a real dataset from the yeast genomes. Moreover, 
this algorithm can be used as a base for further anal- 
ysis. For example, [28] use it to initialize their Hid- 
den Markov Model to compute change-point location 
probabilities. 



Availability and requirements 

• Project name: SegmentorSIsBack 

• Project home page: http://cran.r- projectorg/web/ 
packages/SegmentorSIsBack/index.html 

• Operating systems: Platform independent 

• Programming language: C++ code embedded in 
package 

• License: GNU GPL 

• Any restrictions to use by non-academics: none 



Appendix 

Pseudo-code of the PDP algorithm 



Algorithm 1 The PDP algorithm 



Input: yi a sequence of n data-points, K an integer 

Is a set of possible values for 9 

y the loss function, g a penalty function 
Output: Ck,t a matrix of floats of size K n 

Mk,t a matrix of integers of size K n 

Initialize 

For te{l,...n] 

Cu = mineeis {HU viyhO) -\-g{e)] 

Mi,t = 0 
End For 

Main 

For^G {2,...K} 

ListCandidateyt = {k — 1} 

nk-l _r 

'^k.k-l ~ ^5 
Mk,k = k-l 
For t e {k,...n} 

For r € ListCandidate/^ 

^h = [m/o)^^Ck-u+gie)] 
^k,t = ^k,t-i ^ ^k,t 

EndFor 

Ck,t = ITlinTeListCandidateA: {min^e/, //J^^(6') j 

Mk^t = argmin^eListCandidate^ jmin^g/^ 
For r G ListCandidateyt 

If[S|, = 0] 

ListCandidate^t = ListCandidatey^^ \ {r} 

Endlf 
EndFor 

^kt ~ ^^"^ (UreListCandidatCjt-^^^) 

ListCandidateyt = ListCandidate^^^ U {t} 

Hl^{0) = Ck-i,t+g{e) 

Endlf ' 
EndFor 
EndFor 



Proof of Proposition 0.3 

Searching for one change-point Let us first consider a 
segmentation in 2 segments with a breakpoint at t. We 
define PtiOi, O2), the cost of this segmentation given some 
parameter Oi for the first segment and O2 for the second 
segment: 

t n 



i=i 



i=t+l 



The optimal cost Pt is: 



Pt = minoi 



minen 



i=l 



i=t-\-l 



Having these notations, let us prove the following 
lemma: 
Lemma 0*4* 

• Ifti = landt2 = nthenWtPt>Ci,n 

• Ifti = 1 and t2 < n then W ti — 1 < t < t2 we have 

Pt > Pt2 
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• Ifti > I and t2 = n then ti — 1 < t < t2we have 
Pt>Pt,-i 

• Ifti > 1 and t2 < n then W ti — 1 < t < t2 we have 
Pt>min{Pt,-uPt2} 

Proof 

First scenario [ti = 1 and ^2 = ^] We have: 

Thus we get: Pt > Ci,^. 

Second scenario [ti = 1 and t2 < n] For any t such that 

t < t2we have: 

n 

+ mine ' {t2-t)y{yi,0)+ ^ y(j/,6>) 

Thus we have: 

Pt > t.mine [yiyi^Oi)] + - t).mine [yiyi^O)] 



And we get V ^ < t2 Pt > Pt2' 



J2 y^-^^ 



Third scenario [ti > 1 and t2 = n] We get V — 1 < 

t Pt >Pti-i by reversing the index and using scenario 2. 

Fourth scenario [ti > 1 and ^2 < '^l For any t such that 
ti — 1 < t < t2we obtain: 

ti-l n 
Pt(0l.02) = + J2 ^0^^''^2) 

i=l i=t2+l 

+ {t-ti + l)y(yt,.Oi) + {t2 - t)y{yt,.02). 

Thus, for fixed Oi and O2 and for t e[ti- 1, ^2]. PtiOi, O2) 
is a linear function of t. Thus we obtain that for any Oi and 
O2: 

Pt(0ue2) > min [Pt,-i{ei.e2).Pt2i0i.e2)] > min [Pt,-i.Pt2]^ 

As this is true for any Oi and O2 we get Pt 
min{Pt,-uPt2}^ 



> 



Proof of the main proposition Assume that we have 
a segmentation m in MK,n with a breakpoint in 
a plateau. Then applying lemma 0.4 on the sequence 

{yihe{Tk-i,...rk+i} ^^^^ either be discarded or 

moved to — 1 or t2 without increasing the cost. Thus 



there exists a segmentation in K or fewer segments with- 
out any change-point in the plateau such that its optimal 

cost is CK,n' B 

This theorem is more subtle than we might have thought 
based on our intuition. It does not mean that a change- 
point in a plateau is never optimal but only that it is not 
necessary to have change-points in plateaux to achieve 
optimality. 
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